Revisiting Priorities: Improving MIR Evaluation Practices

نویسنده

  • Bob L. Sturm
چکیده

While there is a consensus that evaluation practices in music informatics (MIR) must be improved, there is no consensus about what should be prioritised in order to do so. Priorities include: 1) improving data; 2) improving figures of merit; 3) employing formal statistical testing; 4) employing cross-validation; and/or 5) implementing transparent, central and immediate evaluation. In this position paper, I argue how these priorities treat only the symptoms of the problem and not its cause: MIR lacks a formal evaluation framework relevant to its aims. I argue that the principal priority is to adapt and integrate the formal design of experiments (DOE) into the MIR research pipeline. Since the aim of DOE is to help one produce the most reliable evidence at the least cost, it stands to reason that DOE will make a significant contribution to MIR. Accomplishing this, however, will not be easy, and will require far more effort than is currently being devoted to it. 1. CONSENSUS: WE NEED BETTER PRACTICES I recall the aims of MIR research in Sec. 1.1, and the importance of evaluation to this pursuit. With respect to these, I describe the aims and shortcomings of MIREX in Sec. 1.2. These motivate the seven evaluation challenges of the MIR “Roadmap” [23], summarised in Sec. 1.3. In Sec. 1.4, I review a specific kind of task that is representative of a major portion of MIR research, and in Sec. 1.5 I look at one implementation of it. This leads to testing a causal model in Sec. 1.6, and the risk of committing the sharpshooter fallacy, which I describe in Sec. 1.7. 1.1 The aims of MIR research MIR research aims to connect real-world users with music and information about music, and to help users make music and information about music [3]. One cannot overstate the importance of relevant and reliable evaluation to this pursuit: 1 we depend upon it to measure the effectiveness of our algorithms and systems, compare them against the 1 An “evaluation” is a protocol for testing a hypothesis or estimating a quantity. An evaluation is “relevant” if it logically addresses the investigated hypothesis or quantity. An evaluation is “reliable” if it results in a repeatable and statistically sound conclusion. c © Bob L. Sturm. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Bob L. Sturm. “Revisiting Priorities: Improving MIR Evaluation Practices”, 17th International Society for Music Information Retrieval Conference, 2016. state of the art, chart the progress of our discipline, and discriminate promising directions from dead ends. The design of any machine listening system 2 involves a series of complex decisions, and so we seek the best evidence to guide this process. This motivates the largest contribution so far to evaluation methodology in MIR research: the Music Information Retrieval Exchange (MIREX) [9].

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Setting Healthcare Priorities at the Macro and Meso Levels: A Framework for Evaluation

Background Priority setting in healthcare is a key determinant of health system performance. However, there is no widely accepted priority setting evaluation framework. We reviewed literature with the aim of developing and proposing a framework for the evaluation of macro and meso level healthcare priority setting practices.   Methods We systematically searched Econlit, PubMed, CINAHL, and EBSC...

متن کامل

Moving forward with complementary feeding: indicators and research priorities. International Food Policy Research Institute (IFPRI) discussion paper 146 (April 2003).

For a number of reasons, progress in improving child feeding practices in the developing world has been remarkably slow. First, complementary feeding practices encompass a number of interrelated behaviors that need to be addressed simultaneously. Child feeding practices are also age-specific within narrow age ranges, which add to the complexity of developing recommendations and measuring respon...

متن کامل

Strategic objectives for improving understanding of informant discrepancies in developmental psychopathology research.

Developmental psychopathology researchers and practitioners commonly conduct behavioral assessments using multiple informants' reports (e.g., parents, teachers, practitioners, children, and laboratory observers). These assessments often yield inconsistent conclusions about important questions in developmental psychopathology research, depending on the informant (e.g., psychiatric diagnoses and ...

متن کامل

Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents

The Arcade Learning Environment (ALE) is an evaluation platform that poses the challenge of building AI agents with general competency across dozens of Atari 2600 games. It supports a variety of different problem settings and it has been receiving increasing attention from the scientific community, leading to some high-profile success stories such as the much publicized Deep Q-Networks (DQN). I...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016